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Abstract 

The nnovennent to bring datasets into the scholarly record as first class research 
products (validated, preserved, cited, and credited) has been inching forward 
for Sonne tinne, but now the pace is quickening. As data publication venues 
proliferate, significant debate continues over fornnats, processes, and 
ternninology. Here, we present an overview of data publication initiatives 
underway and the current conversation, highlighting points of consensus and 
issues still in contention. Data publication innplennentations differ in a variety of 
factors, including the kind of docunnentation, the location of the docunnentation 
relative to the data, and how the data is validated. Publishers nnay present the 
data as supplennental nnaterial to a journal article, with a descriptive "data 
paper," or independently. Connplicating the situation, different initiatives and 
connnnunities use the sanne ternns to refer distinct but overlapping concepts. For 
instance, the ternn "published" nneans that the data is publicly available and 
citable to virtually everyone, but it nnay or nnay not innply that the data has been 
peer-reviewed. In turn, what is nneant by data peer review is far fronn defined; 
standards and processes enconnpass the full range ennployed in reviewing the 
literature, plus sonne novel variations. Basic data citation is a point of 
consensus, but the general agreennent on the core elennents of a dataset 
citation frays if the data is dynannic or part of a larger set. Even as data 
publication is being defined, sonne are looking past publication to other 
nnetaphors, notably "data as software," for solutions to the nnore stubborn 
problenns. 
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i,-/^7^.f^*J Amendments from Version 1 

The article type has been changed from an Opinion to a Review 
to better reflect the content of the article. We shall submit another 
version addressing reviewer comments once the article has 
received the full complement of reviews. 

See referee reports 



Introduction: what does data publication mean? 

The idea that researchers should share data to advance knowledge 
and promote the common good is an old one, but in recent years the 
conversation has shifted from sharing data to "publishing" data^ l 
This shift in language stems from the conviction that datasets 
should join the scholarly record and be afforded the same first class 
status as traditional research products like journal articles'^. While 
many in the scholarly communication community share this goal, 
different people and organizations often imply different things by 
the phrase data publication. 

The community largely agrees on two essential properties of a data 
publication^'^. First, published data is publicly available now and 
for the indefinite future; access might demand payment of fees or 
acceptance of a legal agreement, but not the approval of the author. 
Second, like a book or journal article, a data publication can be for- 
mally cited. Open questions flock around a third property: how and 
to what extent a published dataset must be validated. In an effort to 
clarify the terminology, Callaghan et al. (2012)^ draw a distinction 
between data that has been shared, published (lower-case "p"), or 
Published (upper-case "P"): shared data is available, published data 
is available and citable, and Published data is available, citable, and 
validated. In practice, availability is usually satisfied by deposit- 
ing the dataset in a repository, citability by assigning a persistent 
identifier (e.g. a Digital Object Identifier, or DOI), and validity by 
peer review. 

Why publish data? 

The underlying goals of data publication are to enable research to 
be reproduced and data to be reused. Hidden primary data exacer- 
bates science's very public "reproducibility crisis"^"'\ most recently 
illustrated by the collapse of a pair of irreproducible Nature articles 
describing a simple method to transform somatic cells into pluripo- 
tent stem cells ^"'^^ Widespread publication of the data underlying 
research papers could help expose both honest errors and fraud^^. 
The leaders of the US National Institutes of Health (NIH) recently 
cited "provid[ing] greater transparency of the data that are the 
basis of published manuscripts" as one way to improve scientific 
reproducibility^ I 

Journals already frequently require authors to supply underlying 
data on request. In 2011, Alsheikh-Ah et aV^ found that 88% of 
high-impact journals required a statement regarding the availabil- 
ity of underlying data; half of those made willingness to provide 
data a condition of publication. However, the authors of 59% of 
papers examined in the study failed to adhere to the availability 
instructions. Vines et al. (2014)^^ could only obtain underlying data 
from 101 of 516 papers published from 1991 to 2011. Availability 
dropped off sharply with time; data could be obtained from only 



two of the 62 oldest papers. Now, some journals require that under- 
lying data be published simultaneously with the article. 

In 2010, a coalition of Ecology and Evolutionary Biology journals 
began to require that the data underlying articles be archived with 
a maximum embargo of one year^^'^^. FIOOOResearch has had a 
similar policy (without an embargo period) since its inception, and 
the Public Library of Science (PLOS) journals followed suit ear- 
lier this year^^ Although there can be no substitute for funding new 
experiments and data collection, appropriate data reuse lowers costs 
and accelerates research. Documenting, publishing, and archiv- 
ing data is time consuming and costly, but usually far less so than 
repeating the data collection. For example. Open Context published 
archaeological data from a site in eastern Turkey at the substantial 
cost of $10,000-15,0000, but this publication expense was minor 
compared to $800,000 spent to collect the data^'^. Piwowar (2011) 
contrasted the impact of $100,000 in National Science Foundation 
(NSF) grants, which generates an average of three to four papers, 
with an estimate that the same investment in curating, archiving, 
and publishing data could contribute to over 1,000 publications^^\ 
Furthermore, while some data is merely expensive to recreate, time- 
dependent or ephemeral data, (e.g. climate records or observations 
of unique astronomical events) should be published because it can 
never be recreated for any price^^ 

Types of data publication 

The still-congealing phrase "data publication" covers diverse classes 
of research objects published via diverse processes. Depending on 
the speaker, a data publication might be a spreadsheet on a website, 
a set of images in an institutional archive, a stream of readings from 
a weather station transmitted over the internet, or a peer-reviewed 
article describing a dataset. Because disciplines, sub-disciplines, 
and individual researchers consider different assortments of digital 
material to be data, it is unlikely that any single structure will suit 
every discipline and dataset. But, we can hope that a manageable 
number of designs will fit most data. Five data publication models 
described by Lawrence et al. (2011) are distinguished "by how the 
roles involved in publication are distributed between the various 
actors" (e.g. the author, archive or journal)\ Here, we will more 
simply group data publications into three categories based on the 
accompanying documentation; a dataset may supplement a traditional 
research paper, be the subject of a "data paper", or be independent of 
any paper (Figure 1). 

Data that supplements a paper 

The most familiar kind of data publication is a traditional journal 
article accompanied by underlying data. That data can be hosted by 
the journal as supplementary material or deposited in a third-party 
repository. The trend is away from supplemental material because 
repositories are considered to be better suited to ensure long-term 
preservation and access to the data. For instance. The Journal of 
Neuroscience stopped publishing supplemental material in 2010; the 
announcement promotes disciplinary repositories as "vastly superior 
to supplemental material as a mechanism for disseminating data"^^. 
Data underlying any peer-reviewed or otherwise "reputable" publi- 
cation can be deposited in the Dryad repository. Dryad makes data 
available and citable, but the publisher of the article must manage 
any assessment of scientific validity. Other third-party repositories 
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Figure 1.To be published, datasets are typically deposited in a repository to make them available and assigned an identifier to make 
them citable. Some, but not all, publishers review datasets to validate them. 



include Figshare, Zenodo, institutional repositories (e.g. the Pur- 
due Research Repository), and discipline- specific repositories (e.g. 
DNA sequences are deposited in GenBank^^ and protein structures 
in the Protein Data Bank^^). 

Data as the subject of a paper 

A data paper describes a dataset with thoroughly detailed rationale 
and collection methods, but lacks any analysis or conclusions^^ 
Data papers are flourishing as a new article type in journals such 
as FIOOOResearch, Internet Archaeology, and GigaScience, as well 
as in dedicated journals like Geoscience Data Journal, Nature Pub- 
lishing Group's Scientific Data, and a trio of "metajournals" from 
Ubiquity Press. 

Data paper length and structure varies between journals, but the 
tendency is toward a short, tightly structured format. All journals 
require an abstract, collection methods, and a description of the 
dataset; a few encourage authors to suggest potential uses for the 
data (e.g. Internet Archaeology, and Open Health Data). Some 
journals supplement this general framework with field-specific sec- 
tions, (e.g. Internet Archaeology and the Journal of Open Archae- 
ology Data each include a section for temporal and geographic 
scope.) Data papers are most sharply defined not by the presence 
of any particular information, but by the absence of analysis or 



conclusions. A crisp distinction from other article types is impor- 
tant because many journals do not consider a data paper to be prior 
publication if the authors seek to publish an analysis of the same 
dataset (e.g. Nature-XiiXo^d journals. Science, and others listed by 
FIOOOResearch). 

Data journals generally limit themselves to publishing the descrip- 
tion of the dataset; a trusted repository publishes the data itself. 
For instance. Scientific Data and Geoscience Data Journal each 
direct authors to a list of approved repositories. As an exception, 
GigaScience hosts data in an integrated repository named GigaDB. 
An early implementer of data papers. The International Journal of 
Robotics Research^^ is unusual in that they permit authors to host 
datasets on their own websites. 

Data independent of any paper 

To be useful or reproducible, a dataset must be accompanied by 
descriptive information (i.e. metadata)^^ but this need not take the 
form of a journal article. Instead, some repositories publish rich, 
structured and/or freeform description together with the data. The 
distinction between a data repository and a data publisher is often 
indistinct. Repositories provide access and citability, but the degree 
of validation varies widely and few are equipped to provide peer 
review. For instance, to make data publication as easy as possible 
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for authors, Figshare and Zenodo publish datasets from any field 
with minimal validation. 

Availability 

Fundamentally, to publish is to make public, and to publish data is 
to make data publicly available. Present availability requires mech- 
anisms for access; future availability also requires preservation 
(e.g. long-term storage, format migration)- As in print publica- 
tion, published data need not be free or legally unencumbered, and 
data use agreements constrain many published datasets. If access 
is limited, it should be contingent on clear and objective criteria; 
writing a request to the creator for permission should not be part of 
the process. For example, before granting access to restricted data. 
The Interuniversity Consortium for Political and Social Research 
(ICPSR) judges the applicant's proposed security measures, but not 
the merit of their research. Datasets from social science or clinical 
studies that involve human participants are easily the most com- 
mon source of access restrictions because of the need to protect 
privacy. In the United States, the Health Insurance Portability and 
Accountability Act of 1996 (HIPPA) Privacy Rule severely limits 
the disclosure of medical information-^ 

As a practical matter, publishing a dataset usually includes deposit- 
ing it in a trustworthy repository. What constitutes a "trustworthy" 
repository is somewhat subjective and there are a handful of certi- 
fication schemes to choose from. In 2007, The Center for Research 
Libraries (CRL) published the most extensive scheme: the Trusted 
Repository Audit Checklist (TRAC)-*^. Many repositories consult 
TRAC for self-assessment, but only four (listed by the CRL) have 
completed the lengthy and rigorous process to be officially certi- 
fied. The process to obtain a Data Seal of Approval (DSA) is con- 
siderably more streamlined. The DSA guidelines were also first 
released, by The Dutch Data Archiving and Networked Services 
(DANS), in 2007; 24 repositories have been stamped with the DSA 
since then. Few of the hundreds of repositories in operation (e.g, the 
973 now listed Databib or the 609 at re3data.org) have pursued any 
kind of certification. Given the low adoption of repository certifica- 
tion, a more typical way to decide trustworthiness is to judge by 
the organization responsible. Repositories run by governments or 
large universities are likely to be considered trustworthy (although 
the effects of the 2013 US government shutdown on the PubMed 
biomedical article database^^^ might give one pause). 

Citability 

Data citation is the element of publication that has come the far- 
thest toward consensus. This year, a coalition-including Future Of 
Research Communication and E-Scholarship (FORCEll)^^ the 
Committee on Data for Science and Technology (CODATA)^^, and 
the Digital Curation Centre (DCC)-released a Joint Declaration of 
Data Citation Principles. The first of the eight principles states, in 
part, that "[d]ata citations should be accorded the same importance 
in the scholarly record as citations of other research objects, such as 
publications". Most of the time, this means that when a published 
dataset contributes to a paper, it should be cited formally in the 
reference list. 



Data publishers enable formal citation by assigning unique perma- 
nent identifiers, most commonly the same ones used for journal 
articles: Digital Object Identifiers (DOIs). In addition to clarifying 
exactly what resource is being cited, a DOI can be resolved to locate 
the referenced dataset. Note, however, that a DOI is neither suffi- 
cient nor necessary for citability- if a dataset moves and the DOI is 
not updated, the citation breaks and, conversely a well-maintained 
web-address works as well as a DOI. 

Simple case 

The present consensus is that a dataset should be cited using, at 
a minimum, five elements largely familiar from article citations: 
creator(s), title, year, publisher and identifier. This format agrees 
with COD ATA s recommendation^^ and conveys all the information 
required to obtain a DataCite DOI^^ or be listed in the Thomson- 
Reuters Data Citation Index. However, this article-derived for- 
mat fails to address some of the complications unique to datasets, 
described below. 

Deep citation 

The first major complication that data citation faces is the need for 
deep citation. When supporting an assertion in writing, it usually 
suffices to cite the entirety of a journal article and leave it to the 
inquisitive reader to find the relevant passage. But, to reproduce an 
analysis performed on a subset of a larger dataset, the reader needs 
to know exactly what subset was used (e.g. a limited range of dates, 
only the adult subjects, wind speed but not direction). Datasets vary 
so widely in structure that there may not be a good general solution 
for describing subsets. The most common suggestion is to cite the 
entire dataset in the reference list and describe the subset in the text 
of the paper In straightforward cases, the Federation of Earth Sci- 
ence Information Partners (ESIP) and the National Snow and Ice 
Data Center (NSIDC) both recommend including a list of variables 
or range of dates in the formal citation. 

Dynamic datasets 

The second major complication arises when datasets change. In 
the past, the printing process cemented one version of an article 
as the version of record. Even for traditional scholarly literature, 
web-based publishing and preprint servers (e.g. arXiv.org) are 
complicating the situation, but datasets are especially prone to be 
dynamic. Two kinds of dynamic datasets warrant consideration: 
growing datasets that add new data while never changing or delet- 
ing existing data, and revisable datasets where data may by added, 
deleted, or changed. 

Consider USC00046336, a weather station at the Oakland Museum. 
Each day, the high temperature, low temperature and amount of 
precipitation recorded at the Museum^^ flow, together with data 
from more than 20,000 other stations, into the swelling Global 
Historical CUmate Network (GHCN)-Daily^^' dataset. Or, consider 
WormBase^^, a genome database used by the Caenorhabditis 
elegans research community. WormBase encompasses genomic 
sequences of C. elegans and 20 related species massively annotated 
with gene structures, protein sequences, expression patterns, and a 
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host of other information from empirical data and computational 
predictions. Every two months, WormBase responds to new data 
and better computational models by issuing a revised version with 
new material added and inaccurate material deleted or corrected. 

Additions and updates to published datasets are extremely valua- 
ble, but a researcher seeking to reproduce an analysis of a dynamic 
dataset needs access to a particular version. To enable that access, 
previous versions must be preserved and citable. Growing datasets 
can be cited with an access date or a date range in the citation, as 
recommended by ESIP and NSIDC. Revisable datasets are more 
difficult; the most common approach is to accumulate revisions and 
periodically publish a new version with a citable version number. 
For example, WormBase identifies each release with a citable ver- 
sion number and makes all of the previous versions available. 

Controversy persists around the specific issue of identifiers for 
dynamic datasets. DataCite recommends, but does not insist, that 
their DOIs refer to immutable objects. NSICD and ESIP instruct 
researchers to use a single identifier for growing datasets and 
include the access date in the citation; each major version of a 
revisable datasets gets a new identifier, but minor versions do not. 
In contrast, the DCC, Dataverse, and the UK Natural Environment 
Research Council (NERC) insist that any change to a dataset should 
trigger a new identifier^'^^'^l To handle the difficulties with dynamic 
data that this policy creates, the DCC recommends periodically 
issuing growing datasets a new identifier that refers to the "time- 
slice" of new records and freezing versions of revisable datasets as 
individually-identified "snapshots". 

Just-in-time identifiers 

The difficulties surrounding deep citation and dynamic data could 
potentially be solved by turning the identifier-issuing process on 
its head. Instead of the dataset publisher issuing identifiers for data 
at the level that researchers seem likely to cite, researchers could 
issue identifiers for precisely the part of the dataset that they want 
to cite. The Research Data Alliance (RDA) Data Citation Work- 
ing Group recently put forth a sophisticated proposal applicable to 
data in (or convertible to) databases. Identifiers created under this 
scheme would wrap together identification of a database, a query to 
return the cited dataset, the version of the database queried for this 
analysis, and a number of other useful components. Although prom- 
ising, many technical and policy issues must be resolved before this 
approach can be widely adopted. 

Validation 

Data validation is the least resolved aspect of data publication, and 
fundamental questions are still unanswered: What minimum level 
of quality should a published dataset guarantee? How and by what 
criteria can datasets be evaluated against that guarantee? Is litera- 
ture peer-review an appropriate model? 

Callaghan et al. (2012)"^ draw a useful distinction between techni- 
cal and scientific review. Technical review verifies that a dataset is 
complete, its description is complete, and that the two match up. 
Domain expertise is generally not required, and many repositories 
provide at least some level of technical review. Scientific review 
evaluates the methods of data collection, the overall plausibility of 



the data, and the likely reuse value. Scientific review does require 
domain expertise, making this level of validation more difficult to 
organize, and few repositories provide it. When data is published 
with a data paper, review may be split between the repository for 
technical review and the data journal for scientific review. 

Data paper peer review 

Peer review guarantees that journal articles entering the scholarly 
record reach some level of validity (although the aforementioned 
reproducibility crisis calls into question exactly what that level is). 
In many fields, peer-reviewed publications enjoy a much higher 
status than any other literature. Any effort to apply the prestige of 
"publication" to datasets cascades naturally into an effort to apply 
the prestige of "peer review". But as data validation seeks to model 
itself on literature peer review, literature peer review itself is in 
flux^'^"^^ Open peer review at FIOOOResearch and post-publication 
commenting at PubMed Commons are just two of many ongoing 
web-enabled experiments in article evaluation. 

Journal article reviewers traditionally consider whether the methods 
used are appropriate for the questions asked and the data collected 
support the conclusions drawn. In the absence of particular ques- 
tions and conclusions, it is not obvious what peer review of data 
should certify. A dataset may be suitable for some purposes, but not 
for others^-. In addition, while a reviewer can be expected to read 
an entire article, they cannot inspect every point in a large data- 
set. Finally, researchers are already over- whelmed by peer review 
of articles^' and may find any increased workload unreasonable. 
Despite all these difficulties, venues for peer-reviewed data papers 
are opening rapidly. 

Data paper journals wrap scientific peer review of the paper and the 
dataset together into a single process. GigaScience, an exception, 
assigns technical review of the dataset to a separate data reviewer. 
The standards that various data journals provide to reviewers are 
fairly uniform, with the exception that about half of consider nov- 
elty or potential impact, while the rest only require that the dataset 
be scientifically sound. While review standards are similar, pro- 
cesses differ widely. 

As an example, compare Biodiversity Journal and Scientific Data. 
Both journals divide reviewer guidelines into three sections along 
similar lines, which Biodiversity Journal calls "quality of the data", 
"quality of the description", and "consistency between manuscript 
and data". Scientific Data follows a traditional peer-review pro- 
cess: an editor appoints reviewers who are encouraged to remain 
anonymous. In contrast, review at Biodiversity Journal follows a 
flexible and open process featuring entirely optional anonymity and 
multiple types of reviewer. There, an editor appoints two or three 
"nominated" reviewers who must report back and several "panel" 
reviewers who read the paper and only comment at their discretion. 
Additionally, the authors may choose to open the paper to public 
comment during the review process. 

Independent data validation 

Data journals all model their data validation more or less faithfully 
on literature peer review, but independent data validation practices 
and proposals are considerably more varied. On the conservative 
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end of the spectrum, Lawrence et al (2011) propose a set of crite- 
ria for independent data peer review^^. The Planetary Data System 
(PDS) peer-reviews datasets through the unusual process of hold- 
ing an in-person meeting with representatives of the repository, the 
dataset creators, and the reviewers. 

Two examples from archaeology. Open Context and the Digital 
Archaeological Record (tDAR), illustrate the diversity of approaches 
to data validation. Open Context provides multiple validation pro- 
cesses that incorporate peer review in a way that goes beyond the 
simple accept/reject binary^'^. Each Open Context dataset is rated 
from one to five based not on quality per se, but on the thorough- 
ness of the validation; a one comes with no guarantees, a three has 
passed a technical review, and a five has passed external peer review. 
Whereas Open Context is a boutique publisher, focusing on data 
presentation and reuse, tDAR is a large repository primarily con- 
cerned with with collecting and preserving archaeology data for 
future use. tDAR is able to operate at scale by performing only 
technical validation and streamlining data deposition with a mini- 
mum of mandatory description. However, tDAR also serves as a 
platform for high-quality data publication. The repository accom- 
modates contributors who wish to provide more information, and 
much of the content is deposited by digital curators who can be 
relied on to supply rich descriptions. Furthermore, two data paper 
journals, Internet Archaeology and Journal of Open Archaeological 
Data, recommend tDAR as a repository for their peer-reviewed 
data. Thus, data validation depends not only on discipline and data 
type, but on a host of external factors, including the goals of the 
organizations and researchers involved. 

Pre-publication validation can be supplemented or replaced by 
post-publication feedback from successful or unsuccessful reusers. 
Parsons et al. (2010) suggest that "data use in its own right provides 
a form of review", and go on to point out that the context of reuse 
demonstrates that the data is not generically "good", but fit for some 
particular purpose^-. The DANS repository solicits feedback from 
researchers who use its datasets: users are asked to rate the dataset 
on a one to five scale in each of six criteria (e.g., data quality, quality 
of the documentation, structure of the dataset)"^^ "^^. 

Beyond data publication 

In a 2013 paper^^ Parsons and Fox argue that thinking about data 
through the the metaphor of print "publication" is limiting. Diverse 
kinds of material are regarded as data by one research community 
or another and, while at least some aspects of publication apply 



well to at least some kinds of data, other approaches are possible. 
An alternative metaphor that seems to be gaining traction is "data 
as software"^^ In some cases, it may be better to think of releasing 
a dataset as one would a piece of software and to regard subsequent 
changes as analogous to updated versions. The open- source soft- 
ware community has already developed many potentially relevant 
tools for working collaboratively, managing multiple versions, and 
tracking attribution. Ram (2013)^'^ catalogs a multitude of scientific 
uses for the software version control system Git, including data man- 
agement. Open Context uses Git and Mantis Bug Tracker to track 
and correct dataset errors. Furthermore, projects such as IPython 
Notebook integrate data, processing, and analysis into a single 
package. However, scientific software struggles for recognition^^ 
just as data does, so using it to alter or affect the academic reward 
system for data is a tricky prospect. 

Ultimately, while "data as software" is promising, data is not soft- 
ware. Nor is it literature. The prestige and familiarity of terms like 
"publication" and "peer-review" are powerful, but we may have to 
stretch their definitions if we are determined to apply them to data. 
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The article focuses on a topic that receives a lot of interest these days. Therefore it is very timely. The 
article provides a useful and valuable overview of the current state of affairs and the ongoing debate. The 
title does justice to the content of the article. So does the abstract. 

This is the second version of the article. I do not see many changes in the text based on the earlier critical 
comments made by Mark Parsons and Peter Fox. 

The overview is very informative for everyone who needs a quick introduction into the subject. I do miss 
the opinion of the authors themselves on the issues at hand and on the quoted suggestions by others. 
This would have been appropriate in a concluding paragraph. 

Detailed comments: 

1 . In the Introduction I miss a clear link between data publishing and data citation and creating the 
possibility for researchers to receive academic credits for their work on data. This academic credit 
is crucial as an incentive for researchers to put valuable time and effort in sharing their data. 



2. In the paragraph Why publish data? a reference to the Dutch fraud cases might be useful, as these 
cases got a lot of international attention and more or less triggered the discussion in the 
Netherlands with respect to research data management, long term preservation of data and data 
publishing and citation. A reference could be: 

Doom P, Dillo I, van Horik R: Lies, Damned Lies and Research Data: Can Data Sharing Prevent 
Data Fraud? International Journal of Digital Curation. 2013; 8(1): 229-243 
http://dx.doi.org/10.2218/ijdc.v8i1.256 

3. In the paragraph Types of data publication a threefold model is introduced to categorize data 
publications. It is not clear how this model relates to the terminology and categorisation presented 
in the introduction. This could be somewhat confusing for the reader. 

Furthermore, there are of course many other models available, e.g. that of the The Data 
Publication Pyramid, developed on the basis of the Jim Gray pyramid, to express the different 
manifestation forms that research data can have in the publication process: 

Reilly S, Schallier W, Schrimpf S, etal.\ Report on integration of data and publications. October 
2011. Located at: 
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http://www.stm-assoc.org/201 1 _1 2_5_ODE_Report_On_lntegration_of_Data_and_Publications.pc 



Or the model presented in the report: Costas, R., Meijer, I., Zahedi, Z. and Wouters, P. (2013). The 
Value of Research Data - Metrics for datasets from a cultural and technical point of view. A 
Knowledge Exchange Report, available from www.knowledge-exchange.info/datametrics 

4. With respect to trustworthy digital repositories, I would like to add a few comments. First of all, in 
Europe a European Framework for Audit and Certification of Digital Repositories is emerging. It 
contains three certification standards (DSA, DIN31644/NESTOR seal and IS013636) and three 
levels of certification (basic, extended and formal) see: 

http://www.trusteddigitalrepository.eu/Site/Trusted%20Digital%20Repository.html 

Of these three standards, only DSA has been up and running for some time now ,with 31 seals 
awarded and 30 ongoing self-assessments at this moment. The NESTOR seal has become 
available only very recently and the ISO standard is not yet officially available. The accompanying 
ISO 16919 standard: Requirements for bodies providing audit and certification of candidate 
trustworthy digital repositories, has been published very recently and now the ISO organization 
needs to be set up in the different countries, including the training of national auditors. The audits 
done by CRL are not fully official, since CRL is no formal ISO accreditation body. 

In Europe we see a growing interest in TDRs, coming from funders who want to push open data 
and data sharing and demand the deposit of publicly funded data in long term TDRs. Furthermore 
European research infrastructures and projects are also looking more and more into the issue of 
trust hen it comes to data sharing and a groeing number of them is incorporating (parts of) the DSA 
guidelines into there repositories and policies (e.g. CESSDA, CLARIN, EUDAT). 

Yet another certification procedure is offered by the ICSU/WDS to repositories that aim to become 
a member of the World Data System. See: 
https://www.icsu-wds.org/community/membership/certification 

The certification of TDRs could also help publishers/editorial boards with Data Availability Policies 
to point their authors to the right repositories for the long-term storage of their data. 

I have read this submission. I believe that I have an appropriate level of expertise to confirm that 
it is of an acceptable scientific standard. 

Competing Interests: No competing interests were disclosed. 




Mark Costello 

Institute of Marine Science, University of Auckland, Auckland, New Zealand 



Approved: 28 May 2014 

Referee Report: 28 May 2014 

doi:1 0.5256/f 1 0OOresearch.451 8.r4543 

Having made some similar comments myself I must agree with this review. But a few points may merit 
amendment: 
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Abstract: 

Note that what 'publication' means is the same as in print as in digital form. It does not imply peer-review 
or editorial oversight in either format. 

Citability: 

Yes, data citations are important but some datasets and web-based resources do not show how they 
should be cited, some journals do not allow citations to web resources in the Reference list, and even 
were both possible, too many authors neglect to cite actual datasets and instead cite a web site (which 
may have many datasets) or a related print paper. 

I agree that a DOI is not enough and only permanent if it is updated when documents are moved. A full 
author-tile-publisher citation as you suggest is more informative and human readable. 

I do not think it is problematic to cite parts of datasets. Pages and chapters in books are already cited for 
example. In most datasets it is also possible to identify individual dat records. Also, the actual data used 
could be provided in an Appendix so the reader is left in no doubt. Neither do I think 'versioning' is a 
problem. Where new data are added (e.g. to a time-series) then they comprise a new dataset, as they 
would if published in print. Where many corrections are made they a dataset can be treated like a paper; 
i.e. the original can be 'retracted' and replaced, or the new version be published with the metadata stating 
that it is more accurate. 

You mention venues for peer-reviewed data papers. For this article to advance previous articles, perhaps 
it could expand on these venues and how they manage the details of the peer-review process? 

I have published a few papers you may find of interest: 

1 . Costello MJ, Wieczorek J. 2014. Best practice for biodiversity data management and publication. 
Biological Conservation, 173, 68-73. http://www.vliz.be/en/imis?module=ref&refid=234968 

2. Costello MJ, Appeltans W, Bailly N, Berendsohn WG, de Jong Y, Edwards M, Froese R, 
Huettmann F, Los W, Mees J, Sogers H, Bisby FA. 2014. Strategies for the sustainability of online 
open-access biodiversity databases. Biological Conservation 173, 155-165. 
http://www.marinebiology.ugent.be/component/imis/?module=ref&refid=230520 

3. Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P. 2013. Data should be published, 
cited and peer-reviewed. Trends in Ecology and Evolution 28 (8), 454-461 . 
http://dx.d0i.0rg/l 0.101 6/j.tree.201 3.05.002 

4. Costello, M.J., Vanden Berghe E. 2006. "Ocean Biodiversity Informatics" enabling a new era in 
marine biology research and management. Marine Ecology Progress Series 316, 203-214. 
http://www.int-res.com/abstracts/meps/v316/ 

5. Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P, Chavan V. 2012. Quality 
assurance and intellectual property rights in advancing biodiversity data publications, ver. 1 .0, 
Copenhagen: Global Biodiversity Information Facility, Pp. 33, ISBN: 8792020496. Accessible at 
http://links.gbif.org/qa_ipr_advancing_biodiversity_data_publishing_en_v1 . 

I have read this submission. I believe that I have an appropriate level of expertise to confirm that 
it is of an acceptable scientific standard. 
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• ^ Research Data Alliance, Troy, NY, USA 

2 Rensselaer Polytechnic Institute, Troy, NY, USA 

Approved with reservations: 06 May 2014 

Referee Report: 06 May 2014 

doi:1 0.5256/f 1 000research.4264.r4541 

General Comments: 

Note: This review was written by Parsons and accepted (witii some modification) by Fox. insigiits lil<ely 
come from conversations between Fox and Parsons, errors from Parsons. 

I am very glad the authors wrote this essay. It is a well-written, needed, and useful summary of the current 
status of "data publication" from a certain perspective. The authors, however, need to be bolder and more 
analytical. This is an opinion piece, yet I see little opinion. A certain view is implied by the organization of 
the paper and the references chosen, but they could be more explicit. The paper would be both more 
compelling and useful to a broad readership if the authors moved beyond providing a simple summary of 
the landscape and examined wiiy there is controversy in some areas and then use the evidence they 
have compiled to suggest a path forward. They need to be more forthright in saying what data publication 
means to them, or what parts of it they do not deal with. Are they satisfied with the Lawrence etal. 
definition? Do they accept the critique of Parsons and Fox? What is the scope of their essay? 

The authors take a rather narrow view of data publication, which I think hinders their analyses. They 
describe three types of (digital) data publication: Data as a supplement to an article; data as the subject of 
a paper; and data independent of a paper. The first two types are relatively new and they represent very 
little of the data actually being published or released today. The last category, which is essentially an 
"other" category, is rich in its complexity and encompasses the vast majority of data released. I was 
disappointed that the examples of this type were only the most bare-bones (Zenodo and Figshare). I 
think a deeper examination of this third category and its complexity would help the authors better 
characterize the current landscape and suggest paths forward. 

Some questions the authors might consider: Are these really the only three models in consideration or 
does the publication model overstate a consensus around a certain type of data publication? Why are 
there different models and which approach is better for different situations? Do they have different 
business models or imply different social contracts? Might it also be worthy of typing "publishers" instead 
of "publications"? For example, do domain repositories vs. institutional repositories vs. publishers address 
the issues differently? Are these models sustaining models or just something to get us through the next 
5-1 0 years while we really figure it out? 

I think this oversimplification inhibited some deeper analysis in other areas as well. I would like to see 
more examination of the validation requirement beyond the lens of peer review, and I would like a deeper 
examination of incentives and credit beyond citation. 



Page 12 of 18 



FlOOOResearch 



FIOOOResearch 2014,3:94 Last updated: 11 JUL 2014 



I thought the validation section of the paper was very relevant, but somewhat light. I like the choice of the 
term validation as more accurate than "quality" and it fits quite well with Callaghan's useful distinction 
between technical and scientific review, but I think the authors overemphasize the peer-review style 
approach. The authors rightly argue that "peer-review" is where the publication metaphor leads us, but it 
may be a false path. They overstate some difficulties of peer-review (No-one looks at every data value? 
No, they use statistics, visualization, and other techniques.) while not fully considering who is responsible 
for what. We need a closer examination of different roles and who are appropriate validators (not 
necessarily conventional peers). The narrowly defined models of data publication may easily allow for a 
conventional peer-review process, but it is much more complex in the real-world "other" category. The 
authors discuss some of this in what they call "independent data validation," but they don't draw any 
conclusions. 

Only the simplest of research data collections are validated only by the original creators. More often there 
are teams working together to develop experiments, sampling protocols, algorithms, etc. There are 
additional teams who assess, calibrate, and revise the data as they are collected and assembled. The 
authors discuss some of this in their examples like the PDS and tDAR, but I wish they were more 
analytical and offered an opinion on the way forward. Are there emerging practices or consensus in these 
team-based schemes? The level of service concept illustrated by Open Context may be one such area. 
Would formalizing or codifying some of these processes accomplish the same as peer-review or more? 
What is the role of the curator or data scientist in all of this? Given the authors's backgrounds, I was 
surprised this role was not emphasized more. Finally, I think it is a mistake for science review to be the 
main way to assess reuse value. It has been shown time and again that data end up being used 
effectively (and valued) in ways that original experts never envisioned or even thought valid. 

The discussion of data citation was good and captured the state of the art well, but again I would have 
liked to see some views on a way forward. Have we solved the basic problem and are now just dealing 
with edge cases? Is the "just-in-time identifier" the way to go? What are the implications? Will the more 
basic solutions work in the interim? More critically, are we overemphasizing the role of citation to provide 
academic credit? I was gratified that the authors referenced the Parsons and Fox paper which questions 
the whole data publication metaphor, but I was surprised that they only discussed the "data as software" 
alternative metaphor. That is a useful metaphor, but I think the ecosystem metaphor has broader 
acceptance. I mention this because the authors critique the software metaphor because "using it to alter 
or affect the academic reward system is a tricky prospect". Yet there is little to suggest that data 
publication and corresponding citation alters that system either. Indeed there is little if any evidence that 
data publication and citation incentivize data sharing or stewardship. As Christine Bergman suggests, i/i/e 
need to look more closely at who we are trying to Incentivize to do what. There is no reason to assume it 
follows the same model as research literature publication. It may be beyond the scope of this paper to 
fully examine incentive structures, but it at least needs to be acknowledged that building on the current 
model doesn't seem to be working. 

Finally, what is the takeaway message from this essay? It ends rather abruptly with no summary, no 
suggested directions or immediate challenges to overcome, no call to action, no indications of things we 
should stop trying, and only brief mention of alternative perspectives. What do the authors want us to take 
away from this paper? 

Overall though, this is a timely and needed essay. It is well researched and nicely written with rich 
metaphor. With modifications addressing the detailed comments below and better recognizing the 
complexity of the current data publication landscape, this will be a worthwhile review paper. With more 
significant modification where the authors dig deeper into the complexities and controversies and truly 
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grapple with their implications to suggest a way forward, this could be a very influential paper. It is 
possible that the definitions of "publication" and "peer-review" need not be just stretched but changed or 
even rejected. 

Detailed comments: 

• The whole paper needs a quick copy edit. There are a few typos, missing words, and wrong verb 
tenses. Note the word "data" is a plural noun. E.g., Data are not software, nor are they literature. 
(NSICD, instead of NSIDC) 

• Page 2, para 2: "citability is addressed by assigning a PID." This is not true, as the authors discuss 
on page 4, para 4. Indeed, page 4, para 4 seems to contradict itself. Citation is more than a 
locator/identifier 



• In the discussion of "Data independent of any paper" it is worth noting that there may often be 
linkages between these data and myriad papers. Indeed a looser concept of a data paper has 
existed for some time, where researchers request a citation to a paper even though it is not the 
data nor fully describes the data (e.g the CRU temp records) 

• Page 4, para 1 : I'm not sure it's entirely true that published data cannot involve requesting 
permission. In past work with Indigenous knowledge holders, they were willing to publish summary 
data and then provide the details when satisfied the use was appropriate and not exploitive. I think 
those data were "published" as best they could be. A nit, perhaps, but it highlights that there are 
few if any hard and fast rules about data publication. 

• Page 4, para 2: You may also want to mention the WDS certification effort, which is combining 
with the DSA via an RDA Working Group: 

• Page 4, para 2: The joint declaration of data citation principles involved many more organizations 
than Forcel 1 , CODATA, and DCC. Please credit them all (maybe in a footnote). The glory of the 
effort was that it was truly a joint effort across many groups. There is no leader. Forcel 1 was 
primarily a convener. 

• Page 4, para 6: The deep citation approach recommended by ESIP is not to just to list variables or 
a range of data. It is to identify a "structural index" for the data and to use this to reference subsets. 
In Earth science this structural index is often space and time, but many other indices are 
possible-location in a gene sequence, file type, variable, bandwidth, viewing angle, etc. It is not 
just for "straightforward" data sets. 

• Page 5, para 5: 1 take issue with the statement that few repositories provide scientific review. I can 
think of a couple dozen that do just off the top of my head, and I bet most domain repositories have 
some level of science review. The "scientists" may not always be in house, but the repository is a 
team facilitator. See my general comments. 

• Page 5, para 1 0: The PDS system is only unusual in that it is well documented and advertised. As 
mentioned, this team style approach is actually fairly common 

• Page 6, para 3: Parsons and Fox don't just argue that the data publication metaphor is limiting. 
They also say it is misleading. That should be acknowledged at least, if not actively grappled with. 
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We have read this submission. We believe that we have an appropriate level of expertise to 
confirm that it is of an acceptable scientific standard, however we have significant reservations, 
as outlined above. 

Competing Interests: No competing interests were disclosed. 
1 Comment 

Author Response 

John Kratz, California Digital Library, USA 
Posted: 12 May 2014 

Thank you for refereeing our paper and thank you especially for delivering your report so quickly. 

We submitted the paper as a review article, not an opinion piece, and it was reclassified 
somewhere along the way. I contacted an editor at F1 000 about the issue, and I believe it will be 
switched back shortly. While there is undoubtedly a viewpoint inherent in the way we have 
organized the manuscript, it was our intention to deliver a timely summary of the current landscape 
as a foundation for future thinking, not to offer prescriptions or to endorse particular approaches. 
We have no shortage of opinions about data publication, and a true opinion piece may follow at 
some point, but our aim here was to remain fairly neutral. I think the paper you are asking for would 
also be valuable, but it's an entirely different paper from the one we have written. 

That said, your report is full of suggestions for expansion of analysis and clarification of scope that 
would absolutely improve the paper (e.g. the question of why some issues resist consensus more 
than others is an excellent one), and we will certainly address them in the next version. 
Competing Interests: I am an author of the selected paper. 



Article Comments 

Comments for Version 1 

Eric Kansa, Open Context (http://opencontext.org), USA 
Posted: 06 May 2014 

This is an excellent, and a tremendously useful overview of the issues involved in data publishing. From 
my perspective in archaeology, the discussion of tDAR and Open Context is useful, since these different 
systems try to serve different needs. You may find this poster by Beth Sheehan comparing these different 
systems useful as 

well: http://www.slideshare.net/asist_org/rdap14-comparing-disciplinary-repositories-tdar-vs-open-context 
One small point of clarification on a minor factual point. The Journal of Open Archaeological Data (JOAD) 
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also lists Open Context (http://opencontext.org) as a repository for data, see: 
http://openarchaeologydata.iTietajnl.eom/about/editorialPolicies#opencontext 

Similarly, Internet Archaeology also lists Open Context in the same vein: 
http://intarch.ac.uk/authors/data-papers.html 

Competing Interests: I direct Open Context (see: http://opencontext.org/about/people), so I have a 
professional interest in discussions of this project. 



Konrad Hinsen, Centre de Biophysique Moleculaire (CNRS), France 
Posted: 02 May 2014 

First of all, thanks for this article, which is a good introduction to the problems surrounding data publication. 

One aspect which deserves more attention is the question "What is data?" Or, more precisely, which 
categories of data should be distinguished with respect to publication? This is related to the last paragraph 
of this article that starts with "Ultimately, while "data as software" is promising, data is not software." Data is 
indeed not software - but software is data. 

I would like to propose the following categories of scientific data: 

1 . Observational data. This is the "raw input" of science: data from experiments, observations, polls, 
etc. 

2. Machine-readable information generated by humans. This category includes software, input files, 
workflows, etc. Information for human consumption but also stored electronically could be included 
as well: articles, drawings, software documentation, etc. 

3. Data resulting from a computation: processed observational data, output of simulations, etc. 

Data in category 1 is not reproducible in any way, and thus needs to be archived and published. Data in 
category 2 cannot be reproduced exactly by anyone else, but could be regenerated approximately from 
less complete/precise data by a domain expert. Nevertheless, it should be archived and published as well 
in order to produce a complete and accurate record of scientific activities. Data in category 3 can be 
reproduced by computation if the data in categories 1 and 2 is available. It may be convenient to share it 
nevertheless, in particular if recomputation is expensive, but it's less fundamental than categories 1 and 2. 

I believe that these categories are more useful than the traditional separation into data, software, and 
writeup, in particular for questions such as archiving, citing, and updating. In particular, the vague term 
"dataset" does not distinguish clearly between categories 1 and 3. 
Competing Interests: none 
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Dear authors, 

your article is a very noteworthy and valuable, broad overview of many of the issues surrounding "data 
publication". I would like to offer this as recommended reading to anybody unfamiliar with the field. 
However, there is one omission and one erroneous/misleading statement which I strongly suggest to 
correct: 

• In the first paragraph of "Data as the subject of a paper" you list a number of quite representative 
examples of data journals, but manage to omit the probably first example of a "pure" data journal 
(with peer review of data), ESSD, founded in 2008. 

A brief summary of ESSD's rationale and approach was published 201 1 in D-Lib Magazine, 
doi:10.1045/january2011-pfeiffenberger 

• In the second paragraph of "Citability" you write "DOI is neither sufficient nor necessary for citability- 
if a dataset moves and the DOI is not updated, the citation breaks and, conversely a 
well-maintained web-address works as well as a DOI." 

I regard this as strongly misleading, at least for a novice to the domain of publishing or 
identifiers/DOIs: What typically breaks, sooner or later, is a bookmark with a "normal" URL. The DOI 
system - which I would characterize as "handle system with a policy" - was set up to work around 
that fact of life. The contracts data centers (DC) have to sign with "their" (DataCite) DOI registration 
agency typically contain wording such as: "DC has to ensure that registered content will be 
available for the entire duration of the agreement." (See "contractual form" , linked to from TIB's 
"DOI registration" page.) Admittedly, this and other such agreements are difficult to find. 

By the way, this agreement also adresses the issue of fixity: "Once an item is registered, it may not 
be altered. If an item is changed, it has to be registered with a new DOI name." 

Beyond those corrections, I suggest you provide the reader with some pointers about the venues where 
the ongoing discussions about data publication issues are actually being led. E.g., there are a number of 
working and interest groups at the Research Data Alliance (not just the one on Data Citation) 

best regards, 
Hans Pfeiffenberger 

Competing Interests: I happen to be the founder and chief editor of ESSD 



Chris Hartgerink, Tilburg University, Netherlands 
Posted: 30 Apr 201 4 

Possibly of interest to your paper is dat, a program in development to provide version control of datasets 
(more so than git is able to). It has received funding recently from the Knight Foundation (see here) and is 
something worth looking out for in terms of data sharing, but more importantly, preservation and logging. 

Thank you for writing this — it provides a succinct introduction to an important issue. 
Competing Interests: No competing interests were disclosed. 
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